import pandas as pd
df = pd.read_excel('data/hate_Crimes_v2.xlsx')24 IM939 - Lab 7 Part 1
In the session 7 (week 8) we discussed data and society: academic and practices discourse on the social, political and ethical aspects of data science, and discussed how one can responsibly carry out data science research on social phenomena, what ethical and social frameworks can help us to critically approach data science practices and its effects on society, and what are ethical practices for data scientists.
24.0.1 Datasets
Hate crimes a csv file https://fivethirtyeight.com/features/higher-rates-of-hate-crimes-are-tied-to-income-inequality/
OECD Poverty gap a csv file https://data.oecd.org/inequality/poverty-rate.htm
Poverty & Equity Data Portal https://data.oecd.org/inequality/income-inequality.htm#indicator-chart
https://povertydata.worldbank.org/poverty/home/
24.0.2 Further datasets
NHS multiple files The NHS inequality challenge https://www.nuffieldtrust.org.uk/project/nhs-visual-data-challenge
ONS
- Gender Pay Gap https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/datasets/annualsurveyofhoursandearningsashegenderpaygaptables
- Health state life expectancies by Index of Multiple Deprivation (IMD 2015 and IMD 2019): England, all ages multiple publications https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/healthinequalities/ datasets/healthstatelifeexpectanciesbyindexofmultipledeprivationimdenglandallages
24.0.3 Additional Readings
Indicators - critical reviews The Poverty of Statistics and the Statistics of Poverty https://www.tandfonline.com/doi/full/10.1080/01436590903321844?src=recsys
Indicators in global health arguments: indicators are usually comprehensible to a small group of experts. Why use indicators then? „Because indicators used in global HIV finance offer openings for engagement to promote accountability (…) some indicators and data truly are better than others, and as they were all created by humans, they all can be deconstructed and remade in other forms” Davis, S. (2020). The Uncounted: Politics of Data in Global Health, Cambridge. doi:10.1017/9781108649544
Indicators - conceptualization
24.1 1 Hate Crimes
24.1.1 Source:
https://github.com/fivethirtyeight/data/tree/master/hate-crimes
24.1.2 Variables:
| Header | Definition |
|---|---|
| state | State name |
| median_household_income | Median household income, 2016 |
| share_unemployed_seasonal | Share of the population that is unemployed (seasonally adjusted), Sept. 2016 |
| share_population_in_metro_areas | Share of the population that lives in metropolitan areas, 2015 |
| share_population_with_high_school_degree | Share of adults 25 and older with a high-school degree, 2009 |
| share_non_citizen | Share of the population that are not U.S. citizens, 2015 |
| share_white_poverty | Share of white residents who are living in poverty, 2015 |
| gini_index | Gini Index, 2015 |
| share_non_white | Share of the population that is not white, 2015 |
| share_voters_voted_trump | Share of 2016 U.S. presidential voters who voted for Donald Trump |
| hate_crimes_per_100k_splc | Hate crimes per 100,000 population, Southern Poverty Law Center, Nov. 9-18, 2016 |
| avg_hatecrimes_per_100k_fbi | Average annual hate crimes per 100,000 population, FBI, 2010-2015 |
24.2 2 Reading the dataset
A reminder: anything with a pd. prefix comes from pandas. This is particulary useful for preventing a module from overwriting inbuilt Python functionality.
Let’s have a look at our dataset
df.tail()| NAME | median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 46 | Virginia | 66155 | 0.043 | 0.89 | 0.866 | 0.06 | 0.07 | 0.459 | 0.38 | 0.45 | 0.36 | 1.72 |
| 47 | Washington | 59068 | 0.052 | 0.86 | 0.897 | 0.08 | 0.09 | 0.441 | 0.31 | 0.38 | 0.67 | 3.81 |
| 48 | West Virginia | 39552 | 0.073 | 0.55 | 0.828 | 0.01 | 0.14 | 0.451 | 0.07 | 0.69 | 0.32 | 2.03 |
| 49 | Wisconsin | 58080 | 0.043 | 0.69 | 0.898 | 0.03 | 0.09 | 0.430 | 0.22 | 0.48 | 0.22 | 1.12 |
| 50 | Wyoming | 55690 | 0.040 | 0.31 | 0.918 | 0.02 | 0.09 | 0.423 | 0.15 | 0.70 | 0.00 | 0.26 |
type(df)pandas.core.frame.DataFrame
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NAME 51 non-null object
1 median_household_income 51 non-null int64
2 share_unemployed_seasonal 51 non-null float64
3 share_population_in_metro_areas 51 non-null float64
4 share_population_with_high_school_degree 51 non-null float64
5 share_non_citizen 48 non-null float64
6 share_white_poverty 51 non-null float64
7 gini_index 51 non-null float64
8 share_non_white 51 non-null float64
9 share_voters_voted_trump 51 non-null float64
10 hate_crimes_per_100k_splc 51 non-null float64
11 avg_hatecrimes_per_100k_fbi 51 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 4.9+ KB
24.3 3 Exploring data
24.3.1 Missing values
Let’s explore the dataset
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NAME 51 non-null object
1 median_household_income 51 non-null int64
2 share_unemployed_seasonal 51 non-null float64
3 share_population_in_metro_areas 51 non-null float64
4 share_population_with_high_school_degree 51 non-null float64
5 share_non_citizen 48 non-null float64
6 share_white_poverty 51 non-null float64
7 gini_index 51 non-null float64
8 share_non_white 51 non-null float64
9 share_voters_voted_trump 51 non-null float64
10 hate_crimes_per_100k_splc 51 non-null float64
11 avg_hatecrimes_per_100k_fbi 51 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 4.9+ KB
The above tables shows that we have some missing data for some of states. See below too.
df.isna().sum()NAME 0
median_household_income 0
share_unemployed_seasonal 0
share_population_in_metro_areas 0
share_population_with_high_school_degree 0
share_non_citizen 3
share_white_poverty 0
gini_index 0
share_non_white 0
share_voters_voted_trump 0
hate_crimes_per_100k_splc 0
avg_hatecrimes_per_100k_fbi 0
dtype: int64
import numpy as np
np.unique(df.NAME)array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
'New Jersey', 'New Mexico', 'New York', 'North Carolina',
'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)
There aren’t any unexpected values in ‘state’.
24.4 Mapping hate crime across the USA
#using James' code from the last lab: we need the geospatial polygons of the states in America
import geopandas as gpd
import pandas as pd
import altair as alt
geo_states = gpd.read_file('data/gz_2010_us_040_00_500k.json')
#df = pd.read_excel('data/hate_Crimes_v2.xlsx')
geo_states.head()| GEO_ID | STATE | NAME | LSAD | CENSUSAREA | geometry | |
|---|---|---|---|---|---|---|
| 0 | 0400000US23 | 23 | Maine | 30842.923 | MULTIPOLYGON (((-67.61976 44.51975, -67.61541 ... | |
| 1 | 0400000US25 | 25 | Massachusetts | 7800.058 | MULTIPOLYGON (((-70.83204 41.60650, -70.82373 ... | |
| 2 | 0400000US26 | 26 | Michigan | 56538.901 | MULTIPOLYGON (((-88.68443 48.11579, -88.67563 ... | |
| 3 | 0400000US30 | 30 | Montana | 145545.801 | POLYGON ((-104.05770 44.99743, -104.25015 44.9... | |
| 4 | 0400000US32 | 32 | Nevada | 109781.180 | POLYGON ((-114.05060 37.00040, -114.04999 36.9... |
alt.Chart(geo_states, title='US states').mark_geoshape().encode(
).properties(
width=500,
height=300
).project(
type='albersUsa'
)# Add the data
#should i rename 'state' to 'NAME'?
geo_states = geo_states.merge(df, on='NAME')geo_states.head()| GEO_ID | STATE | NAME | LSAD | CENSUSAREA | geometry | median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0400000US23 | 23 | Maine | 30842.923 | MULTIPOLYGON (((-67.61976 44.51975, -67.61541 ... | 51710 | 0.044 | 0.54 | 0.902 | NaN | 0.12 | 0.437 | 0.09 | 0.45 | 0.61 | 2.62 | |
| 1 | 0400000US25 | 25 | Massachusetts | 7800.058 | MULTIPOLYGON (((-70.83204 41.60650, -70.82373 ... | 63151 | 0.046 | 0.97 | 0.890 | 0.09 | 0.08 | 0.475 | 0.27 | 0.34 | 0.63 | 4.80 | |
| 2 | 0400000US26 | 26 | Michigan | 56538.901 | MULTIPOLYGON (((-88.68443 48.11579, -88.67563 ... | 52005 | 0.050 | 0.87 | 0.879 | 0.04 | 0.09 | 0.451 | 0.24 | 0.48 | 0.40 | 3.20 | |
| 3 | 0400000US30 | 30 | Montana | 145545.801 | POLYGON ((-104.05770 44.99743, -104.25015 44.9... | 51102 | 0.041 | 0.34 | 0.908 | 0.01 | 0.10 | 0.435 | 0.10 | 0.57 | 0.49 | 2.95 | |
| 4 | 0400000US32 | 32 | Nevada | 109781.180 | POLYGON ((-114.05060 37.00040, -114.04999 36.9... | 49875 | 0.067 | 0.87 | 0.839 | 0.10 | 0.08 | 0.448 | 0.50 | 0.46 | 0.14 | 2.11 |
alt.Chart(geo_states, title='PRE-election Hate crime per 100k').mark_geoshape().encode(
color='avg_hatecrimes_per_100k_fbi',
tooltip=['NAME', 'avg_hatecrimes_per_100k_fbi']
).properties(
width=500,
height=300
).project(
type='albersUsa'
)alt.Chart(geo_states, title='POST-election Hate crime per 100k').mark_geoshape().encode(
color='hate_crimes_per_100k_splc',
tooltip=['NAME', 'hate_crimes_per_100k_splc']
).properties(
width=500,
height=300
).project(
type='albersUsa'
)24.5 Exploring data
import seaborn as sns
sns.pairplot(data = df.iloc[:,1:])
df.boxplot(column=['median_household_income'])<Axes: >

df.boxplot(column=['avg_hatecrimes_per_100k_fbi'])<Axes: >

We may want to drop columns (remove them). Details are here.
Let us drop Hawaii.
df[df.NAME == 'Hawaii']| NAME | median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | Hawaii | 71223 | 0.034 | 0.76 | 0.904 | 0.08 | 0.07 | 0.433 | 0.81 | 0.3 | 0.0 | 0.0 |
df = df.drop(df.index[11])df.describe()| median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 50.000000 | 50.000000 | 50.000000 | 50.000000 | 47.000000 | 50.000000 | 50.000000 | 50.000000 | 50.00000 | 50.000000 | 50.000000 |
| mean | 54903.620000 | 0.049880 | 0.750000 | 0.868420 | 0.054043 | 0.092200 | 0.454180 | 0.305800 | 0.49380 | 0.281200 | 2.363200 |
| std | 9010.994814 | 0.010571 | 0.183425 | 0.034049 | 0.031184 | 0.024767 | 0.020889 | 0.150551 | 0.11674 | 0.255779 | 1.714502 |
| min | 35521.000000 | 0.028000 | 0.310000 | 0.799000 | 0.010000 | 0.040000 | 0.419000 | 0.060000 | 0.04000 | 0.000000 | 0.260000 |
| 25% | 48358.500000 | 0.042250 | 0.630000 | 0.839750 | 0.030000 | 0.080000 | 0.440000 | 0.192500 | 0.42000 | 0.130000 | 1.290000 |
| 50% | 54613.000000 | 0.051000 | 0.790000 | 0.874000 | 0.040000 | 0.090000 | 0.454500 | 0.275000 | 0.49500 | 0.215000 | 1.980000 |
| 75% | 60652.750000 | 0.057750 | 0.897500 | 0.897750 | 0.080000 | 0.100000 | 0.466750 | 0.420000 | 0.57750 | 0.345000 | 3.182500 |
| max | 76165.000000 | 0.073000 | 1.000000 | 0.918000 | 0.130000 | 0.170000 | 0.532000 | 0.630000 | 0.70000 | 1.520000 | 10.950000 |
df.plot(x = 'avg_hatecrimes_per_100k_fbi', y = 'median_household_income', kind='scatter')<Axes: xlabel='avg_hatecrimes_per_100k_fbi', ylabel='median_household_income'>

df.plot(x = 'hate_crimes_per_100k_splc', y = 'median_household_income', kind='scatter')<Axes: xlabel='hate_crimes_per_100k_splc', ylabel='median_household_income'>

df[df.hate_crimes_per_100k_splc > (np.std(df.hate_crimes_per_100k_splc) * 2.5)]| NAME | median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | District of Columbia | 68277 | 0.067 | 1.00 | 0.871 | 0.11 | 0.04 | 0.532 | 0.63 | 0.04 | 1.52 | 10.95 |
| 37 | Oregon | 58875 | 0.062 | 0.87 | 0.891 | 0.07 | 0.10 | 0.449 | 0.26 | 0.41 | 0.83 | 3.39 |
| 47 | Washington | 59068 | 0.052 | 0.86 | 0.897 | 0.08 | 0.09 | 0.441 | 0.31 | 0.38 | 0.67 | 3.81 |
import matplotlib.pyplot as plt
outliers_df = df[df.hate_crimes_per_100k_splc > (np.std(df.hate_crimes_per_100k_splc) * 2.5)]
df.plot(x = 'hate_crimes_per_100k_splc', y = 'median_household_income', kind='scatter')
plt.scatter(outliers_df.hate_crimes_per_100k_splc, outliers_df.median_household_income ,c='red')<matplotlib.collections.PathCollection at 0x16eb69e90>

df_pivot = df.pivot_table(index=['NAME'], values=['hate_crimes_per_100k_splc', 'avg_hatecrimes_per_100k_fbi', 'median_household_income'])
df_pivot
##sort by values
#df_pivot = pd.pivot_table(df, index=['state'], columns = ['hate_crimes_per_100k_splc'], fill_value=0)
#df_pivot
#df2 = df_pivot.reindex(df_pivot['hate_crimes_per_100k_splc'].sort_values(by='hate_crimes_per_100k_splc', ascending=False).index)| avg_hatecrimes_per_100k_fbi | hate_crimes_per_100k_splc | median_household_income | |
|---|---|---|---|
| NAME | |||
| Alabama | 1.80 | 0.12 | 42278 |
| Alaska | 1.65 | 0.14 | 67629 |
| Arizona | 3.41 | 0.22 | 49254 |
| Arkansas | 0.86 | 0.06 | 44922 |
| California | 2.39 | 0.25 | 60487 |
| Colorado | 2.80 | 0.39 | 60940 |
| Connecticut | 3.77 | 0.33 | 70161 |
| Delaware | 1.46 | 0.32 | 57522 |
| District of Columbia | 10.95 | 1.52 | 68277 |
| Florida | 0.69 | 0.18 | 46140 |
| Georgia | 0.41 | 0.12 | 49555 |
| Idaho | 1.89 | 0.12 | 53438 |
| Illinois | 1.04 | 0.19 | 54916 |
| Indiana | 1.75 | 0.24 | 48060 |
| Iowa | 0.56 | 0.45 | 57810 |
| Kansas | 2.14 | 0.10 | 53444 |
| Kentucky | 4.20 | 0.32 | 42786 |
| Louisiana | 1.34 | 0.10 | 42406 |
| Maine | 2.62 | 0.61 | 51710 |
| Maryland | 1.32 | 0.37 | 76165 |
| Massachusetts | 4.80 | 0.63 | 63151 |
| Michigan | 3.20 | 0.40 | 52005 |
| Minnesota | 3.61 | 0.62 | 67244 |
| Mississippi | 0.62 | 0.06 | 35521 |
| Missouri | 1.90 | 0.18 | 56630 |
| Montana | 2.95 | 0.49 | 51102 |
| Nebraska | 2.68 | 0.15 | 56870 |
| Nevada | 2.11 | 0.14 | 49875 |
| New Hampshire | 2.10 | 0.15 | 73397 |
| New Jersey | 4.41 | 0.07 | 65243 |
| New Mexico | 1.88 | 0.29 | 46686 |
| New York | 3.10 | 0.35 | 54310 |
| North Carolina | 1.26 | 0.24 | 46784 |
| North Dakota | 4.74 | 0.00 | 60730 |
| Ohio | 3.24 | 0.19 | 49644 |
| Oklahoma | 1.08 | 0.13 | 47199 |
| Oregon | 3.39 | 0.83 | 58875 |
| Pennsylvania | 0.43 | 0.28 | 55173 |
| Rhode Island | 1.28 | 0.09 | 58633 |
| South Carolina | 1.93 | 0.20 | 44929 |
| South Dakota | 3.30 | 0.00 | 53053 |
| Tennessee | 3.13 | 0.19 | 43716 |
| Texas | 0.75 | 0.21 | 53875 |
| Utah | 2.38 | 0.13 | 63383 |
| Vermont | 1.90 | 0.32 | 60708 |
| Virginia | 1.72 | 0.36 | 66155 |
| Washington | 3.81 | 0.67 | 59068 |
| West Virginia | 2.03 | 0.32 | 39552 |
| Wisconsin | 1.12 | 0.22 | 58080 |
| Wyoming | 0.26 | 0.00 | 55690 |
df_pivot.sort_values(by=['avg_hatecrimes_per_100k_fbi'], ascending=False)| avg_hatecrimes_per_100k_fbi | hate_crimes_per_100k_splc | median_household_income | |
|---|---|---|---|
| NAME | |||
| District of Columbia | 10.95 | 1.52 | 68277 |
| Massachusetts | 4.80 | 0.63 | 63151 |
| North Dakota | 4.74 | 0.00 | 60730 |
| New Jersey | 4.41 | 0.07 | 65243 |
| Kentucky | 4.20 | 0.32 | 42786 |
| Washington | 3.81 | 0.67 | 59068 |
| Connecticut | 3.77 | 0.33 | 70161 |
| Minnesota | 3.61 | 0.62 | 67244 |
| Arizona | 3.41 | 0.22 | 49254 |
| Oregon | 3.39 | 0.83 | 58875 |
| South Dakota | 3.30 | 0.00 | 53053 |
| Ohio | 3.24 | 0.19 | 49644 |
| Michigan | 3.20 | 0.40 | 52005 |
| Tennessee | 3.13 | 0.19 | 43716 |
| New York | 3.10 | 0.35 | 54310 |
| Montana | 2.95 | 0.49 | 51102 |
| Colorado | 2.80 | 0.39 | 60940 |
| Nebraska | 2.68 | 0.15 | 56870 |
| Maine | 2.62 | 0.61 | 51710 |
| California | 2.39 | 0.25 | 60487 |
| Utah | 2.38 | 0.13 | 63383 |
| Kansas | 2.14 | 0.10 | 53444 |
| Nevada | 2.11 | 0.14 | 49875 |
| New Hampshire | 2.10 | 0.15 | 73397 |
| West Virginia | 2.03 | 0.32 | 39552 |
| South Carolina | 1.93 | 0.20 | 44929 |
| Vermont | 1.90 | 0.32 | 60708 |
| Missouri | 1.90 | 0.18 | 56630 |
| Idaho | 1.89 | 0.12 | 53438 |
| New Mexico | 1.88 | 0.29 | 46686 |
| Alabama | 1.80 | 0.12 | 42278 |
| Indiana | 1.75 | 0.24 | 48060 |
| Virginia | 1.72 | 0.36 | 66155 |
| Alaska | 1.65 | 0.14 | 67629 |
| Delaware | 1.46 | 0.32 | 57522 |
| Louisiana | 1.34 | 0.10 | 42406 |
| Maryland | 1.32 | 0.37 | 76165 |
| Rhode Island | 1.28 | 0.09 | 58633 |
| North Carolina | 1.26 | 0.24 | 46784 |
| Wisconsin | 1.12 | 0.22 | 58080 |
| Oklahoma | 1.08 | 0.13 | 47199 |
| Illinois | 1.04 | 0.19 | 54916 |
| Arkansas | 0.86 | 0.06 | 44922 |
| Texas | 0.75 | 0.21 | 53875 |
| Florida | 0.69 | 0.18 | 46140 |
| Mississippi | 0.62 | 0.06 | 35521 |
| Iowa | 0.56 | 0.45 | 57810 |
| Pennsylvania | 0.43 | 0.28 | 55173 |
| Georgia | 0.41 | 0.12 | 49555 |
| Wyoming | 0.26 | 0.00 | 55690 |
#This is code for standarization
from sklearn import preprocessing
import numpy as np
#Get column names first
#names = df.columns
#df_stand = df[['median_household_income','share_unemployed_seasonal']]
df_stand = df[['median_household_income','share_unemployed_seasonal', 'share_population_in_metro_areas'
, 'share_population_with_high_school_degree', 'share_non_citizen', 'share_white_poverty', 'gini_index'
, 'share_non_white', 'share_voters_voted_trump', 'hate_crimes_per_100k_splc', 'avg_hatecrimes_per_100k_fbi']]
names = df_stand.columns
#Create the Scaler object
scaler = preprocessing.StandardScaler()
#Fit your data on the scaler object
df2 = scaler.fit_transform(df_stand)
df2 = pd.DataFrame(df2, columns=names)
df2.tail()| median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 45 | 1.261305 | -0.657461 | 0.771002 | -0.071795 | 0.193108 | -0.905436 | 0.233085 | 0.497859 | -0.379003 | 0.311206 | -0.378961 |
| 46 | 0.466836 | 0.202590 | 0.605787 | 0.847894 | 0.841399 | -0.089728 | -0.637357 | 0.028181 | -0.984716 | 1.535493 | 0.852428 |
| 47 | -1.720951 | 2.209376 | -1.101431 | -1.199157 | -1.427620 | 1.949543 | -0.153778 | -1.582146 | 1.697727 | 0.153233 | -0.196315 |
| 48 | 0.356079 | -0.657461 | -0.330429 | 0.877562 | -0.779329 | -0.089728 | -1.169293 | -0.575692 | -0.119412 | -0.241698 | -0.732470 |
| 49 | 0.088155 | -0.944145 | -2.423149 | 1.470910 | -1.103475 | -0.089728 | -1.507798 | -1.045370 | 1.784258 | -1.110547 | -1.239166 |
ax = sns.boxplot(data=df2, orient="h")
#wanted to remove row with Hawaii (row nr 11) following https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/
df2 = df.copy()
df2
#df2.drop('Hawaii')
#df2.drop(11) #drop Hawaii row
df2.drop(df.index[11])
df2.tail()| NAME | median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 46 | Virginia | 66155 | 0.043 | 0.89 | 0.866 | 0.06 | 0.07 | 0.459 | 0.38 | 0.45 | 0.36 | 1.72 |
| 47 | Washington | 59068 | 0.052 | 0.86 | 0.897 | 0.08 | 0.09 | 0.441 | 0.31 | 0.38 | 0.67 | 3.81 |
| 48 | West Virginia | 39552 | 0.073 | 0.55 | 0.828 | 0.01 | 0.14 | 0.451 | 0.07 | 0.69 | 0.32 | 2.03 |
| 49 | Wisconsin | 58080 | 0.043 | 0.69 | 0.898 | 0.03 | 0.09 | 0.430 | 0.22 | 0.48 | 0.22 | 1.12 |
| 50 | Wyoming | 55690 | 0.040 | 0.31 | 0.918 | 0.02 | 0.09 | 0.423 | 0.15 | 0.70 | 0.00 | 0.26 |
import scipy.stats
#instead of running it one by one for every pair of variables, like:
#scipy.stats.pearsonr(st_wine.quality.values, st_wine.alcohol.values)
corrMatrix = df2.corr(numeric_only=True).round(2)
print (corrMatrix) median_household_income \
median_household_income 1.00
share_unemployed_seasonal -0.34
share_population_in_metro_areas 0.29
share_population_with_high_school_degree 0.64
share_non_citizen 0.28
share_white_poverty -0.82
gini_index -0.15
share_non_white -0.00
share_voters_voted_trump -0.57
hate_crimes_per_100k_splc 0.33
avg_hatecrimes_per_100k_fbi 0.32
share_unemployed_seasonal \
median_household_income -0.34
share_unemployed_seasonal 1.00
share_population_in_metro_areas 0.37
share_population_with_high_school_degree -0.61
share_non_citizen 0.31
share_white_poverty 0.19
gini_index 0.53
share_non_white 0.59
share_voters_voted_trump -0.21
hate_crimes_per_100k_splc 0.18
avg_hatecrimes_per_100k_fbi 0.07
share_population_in_metro_areas \
median_household_income 0.29
share_unemployed_seasonal 0.37
share_population_in_metro_areas 1.00
share_population_with_high_school_degree -0.27
share_non_citizen 0.75
share_white_poverty -0.39
gini_index 0.52
share_non_white 0.60
share_voters_voted_trump -0.58
hate_crimes_per_100k_splc 0.26
avg_hatecrimes_per_100k_fbi 0.21
share_population_with_high_school_degree \
median_household_income 0.64
share_unemployed_seasonal -0.61
share_population_in_metro_areas -0.27
share_population_with_high_school_degree 1.00
share_non_citizen -0.30
share_white_poverty -0.48
gini_index -0.58
share_non_white -0.56
share_voters_voted_trump -0.13
hate_crimes_per_100k_splc 0.21
avg_hatecrimes_per_100k_fbi 0.16
share_non_citizen \
median_household_income 0.28
share_unemployed_seasonal 0.31
share_population_in_metro_areas 0.75
share_population_with_high_school_degree -0.30
share_non_citizen 1.00
share_white_poverty -0.38
gini_index 0.51
share_non_white 0.76
share_voters_voted_trump -0.62
hate_crimes_per_100k_splc 0.28
avg_hatecrimes_per_100k_fbi 0.30
share_white_poverty gini_index \
median_household_income -0.82 -0.15
share_unemployed_seasonal 0.19 0.53
share_population_in_metro_areas -0.39 0.52
share_population_with_high_school_degree -0.48 -0.58
share_non_citizen -0.38 0.51
share_white_poverty 1.00 0.01
gini_index 0.01 1.00
share_non_white -0.24 0.59
share_voters_voted_trump 0.54 -0.46
hate_crimes_per_100k_splc -0.26 0.38
avg_hatecrimes_per_100k_fbi -0.26 0.42
share_non_white \
median_household_income -0.00
share_unemployed_seasonal 0.59
share_population_in_metro_areas 0.60
share_population_with_high_school_degree -0.56
share_non_citizen 0.76
share_white_poverty -0.24
gini_index 0.59
share_non_white 1.00
share_voters_voted_trump -0.44
hate_crimes_per_100k_splc 0.12
avg_hatecrimes_per_100k_fbi 0.08
share_voters_voted_trump \
median_household_income -0.57
share_unemployed_seasonal -0.21
share_population_in_metro_areas -0.58
share_population_with_high_school_degree -0.13
share_non_citizen -0.62
share_white_poverty 0.54
gini_index -0.46
share_non_white -0.44
share_voters_voted_trump 1.00
hate_crimes_per_100k_splc -0.69
avg_hatecrimes_per_100k_fbi -0.50
hate_crimes_per_100k_splc \
median_household_income 0.33
share_unemployed_seasonal 0.18
share_population_in_metro_areas 0.26
share_population_with_high_school_degree 0.21
share_non_citizen 0.28
share_white_poverty -0.26
gini_index 0.38
share_non_white 0.12
share_voters_voted_trump -0.69
hate_crimes_per_100k_splc 1.00
avg_hatecrimes_per_100k_fbi 0.68
avg_hatecrimes_per_100k_fbi
median_household_income 0.32
share_unemployed_seasonal 0.07
share_population_in_metro_areas 0.21
share_population_with_high_school_degree 0.16
share_non_citizen 0.30
share_white_poverty -0.26
gini_index 0.42
share_non_white 0.08
share_voters_voted_trump -0.50
hate_crimes_per_100k_splc 0.68
avg_hatecrimes_per_100k_fbi 1.00
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
corrMatrix = df2.corr(numeric_only=True).round(1) #I added here ".round(1)" so that's easier to read given number of variables
sn.heatmap(corrMatrix, annot=True)
plt.show()
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
x = df2[['median_household_income', 'share_population_with_high_school_degree', 'share_voters_voted_trump']]
y = df2[['avg_hatecrimes_per_100k_fbi']]
#what if we change the y variable
#y = df2[['hate_crimes_per_100k_splc']]
est = LinearRegression(fit_intercept = True)
est.fit(x, y)
print("Coefficients:", est.coef_)
print ("Intercept:", est.intercept_)
model = LinearRegression()
model.fit(x, y)
y_hat = model.predict(x)
print ("MSE:", metrics.mean_squared_error(y, y_hat))
print ("R^2:", metrics.r2_score(y, y_hat))
print ("var:", y.var())Coefficients: [[-1.63935828e-05 7.65352737e+00 -7.85302986e+00]]
Intercept: [0.49461694]
MSE: 2.1105276140605045
R^2: 0.26736253642536767
var: avg_hatecrimes_per_100k_fbi 2.939516
dtype: float64